First, we load the tidyverse R packages.

Motivation

Let’s say you have a set of \(n\) observations \(X_i = (X_{i1},...,X_{ip})\) for \(i = (1, \ldots, n)\) each with \(p\) features. This data may or may not be “high-dimensional” (or \(p>>n\)). Depending on your goal in the analysis, you will find dimensionality reduction methods can be useful in some of the following scenarios (not mutually exclusive):

  1. Exploratory data analysis

As we have seen in class, visualizing data is one of the most important parts of data science. The right visualization may reveal problems with the data that can render the results from a standard analysis, although typically appropriate, completely useless. It can also help us make us important discoveries.

For example, there are methods for visualizing univariate and paired data, but if we want to visualize relationships between columns or between rows in data that is high-dimensional, this can be complicated. For example, if each observation has \(p\)=100 features, we would have to create plots that reveal relationships between columns or between rows are more complicated due to the high dimensionality of data (e.g. what if there are strong correlations between the features?). For example, to compare each of the 100 features, we would have to create \({p \choose 2} = 4950\) scatter plots. Creating one single scatter plot of the data is not possible due to high dimensionality.

Methods for dimension reduction can be used to preserve important characterisitics in the data, such as the distance between features or observations, but with fewer dimensions, making data visualization (or exploratory data analysis) feasible.

We will consider different methods for dimension reduction, but the main approach we will discuss is called Principal Components Analysis. The goal here is to find a new set of multivariate variables that are uncorrelated and explain as much variance as possible.

For our purposes, EDA is the primary reason we will be using dimensionality reduction methods for this lecture.

  1. Data Compression

Here we are interested in finding the best matrix created with fewer variables (lower rank) that explains the original data.

In the figure below, we see using something called the “first principal component” only really captures the difference between light and dark. As you increase the number of principal components (PCs), you see more detail in the image. You can think of principal components as how much data compression to do where using a small number PCs results in a large amount of data compression and using a large number of PCs performs very little data compression.

If you go far enough, you will recover an image that looks very similar to the original, but that is of a smaller dimension than the original dataset.

[image source]

  1. Improve classification (hopefully)

We will get to this until term 2, so stayed tuned!

Motivating example of twin heights

Let’s assume we are interested in measuring the heights of twins. For example, consider the identical twin astronauts, Scott and Mark Kelly, who were subjects of NASA’s Twins Study. Scott (right) spent a year in space while Mark (left) stayed on Earth as a control subject. NASA was interested in looking at the effects of space travel on the human body.

[image source]

For our purpose, let’s assume we are only interested in their heights, so we gather data for \(n\)=100 pairs of twins. So, for class, we can simulate \(n\)=100 two-dimensional (or \(p\)=2 features) points that represent heights of a pair of twins.

Ok, let’s simulate the heights of pairs of twins.

##      twin_height_1 twin_height_2
## [1,]      67.72362      67.32411
## [2,]      68.56875      70.20449
## [3,]      69.04952      68.48654
## [4,]      71.10088      72.11233
## [5,]      70.21862      68.46903
## [6,]      70.17676      69.69639

For our purposes let’s say we assume the number of features (\(p\)=2) is too large (or too high-dimensional) and we want to reduce the dimensions to \(p\)=1 (I know this might seem ridiculous, but please stay with me for a few mins).

Here we will motivate methods for dimensionality reduction by showing a transformation which permits us to approximate the distance between two dimensional points with just one dimension.

Consider the first two observations (i.e. first rows) in our dataset X (red points).

The distance between those two points is

## [1] 3.001808

Or we can calculate the Euclidean distance between all observations pairwise using the dist() function.

##          2        3        4        5
## 1 3.001808 1.763315 5.859435 2.745157
## 2 0.000000 1.783949 3.170413 2.394554
## 3 1.783949 0.000000 4.165861 1.169231
## 4 3.170413 4.165861 0.000000 3.748604
## 5 2.394554 1.169231 3.748604 0.000000

Note, if we center the data by removing the average from both columns, we note the distance between the pair of twins do not change.

Here we use the scale() function which has a center and scale argument to remove the column mean and divide by the standard deviation for each column. You can also try the sweep() function for just centering or just scaling.

##      twin_height_1 twin_height_2
## [1,]   -1.27826065    -1.6911377
## [2,]   -0.43312378     1.1892437
## [3,]    0.04764383    -0.5287020
## [4,]    2.09900233     3.0970841
## [5,]    1.21674350    -0.5462180
## [6,]    1.17488426     0.6811482
##          2        3        4        5
## 1 3.001808 1.763315 5.859435 2.745157
## 2 0.000000 1.783949 3.170413 2.394554
## 3 1.783949 0.000000 4.165861 1.169231
## 4 3.170413 4.165861 0.000000 3.748604
## 5 2.394554 1.169231 3.748604 0.000000

Ok, let’s start with the naive approach of simply removing one of the two dimensions. Let’s compare the actual distances to the distance computed with just one of the dimensions. The plot below shows the comparison to the first dimension (left) and to the second (right)

We see the actual distance is generally underestimated using only 1 of the 2 columns of heights. This is actually to expected since we are adding more things in the actual distance. If instead we average and use this distance,

\[d_{ij} = \sqrt{ \frac{1}{2} \sum_{p=1}^2 (X_{i,p}-X_{j,p})^2 }\] then we see the bias goes away

We also see there is a strong correlation.

## [1] 0.97286

Can we pick a one dimensional summary that makes this correlation even stronger?

If we look back at the plot, and visualize a line between any pair of points, the length of this line is the distance between the two points. These lines tend to go along the direction of the diagonal. Notice that if we instead plot

We see almost all the variation can be explained by the Mean axis. This means that we can essentially ignore the second dimension and not lose too much information. If the line is completely flat, we lose no information. If we use this transformation of the data instead we get much higher correlation:

## [1] 0.9973466

Note that each row of \(X\) was transformed using a linear transformation.

If you are familiar with linear algebra we can write the operation we just performed like this:

\[ Z = X A \mbox{ with } A = \, \begin{pmatrix} 1/2&1\\ 1/2&-1\\ \end{pmatrix} \]

And that we can transform back by simply multiplying by \(A^{-1}\) as follows:

\[ X = Z A^{-1} \mbox{ with } A^{-1} = \, \begin{pmatrix} 1&1\\ 1/2&-1/2\\ \end{pmatrix} \]

Note: we can actually guarantee that the distance scales remain the same if we re-scale the columns of \(A\) to assure that the sum of squares are 1:

\[a_{1,1}^2 + a_{2,1}^2 = 1\mbox{ and } a_{2,1}^2 + a_{2,2}^2=1\]

and the correlation of the columns is 0. In this particular example to achieve this, we multiply the first set of coefficients (first column of \(A\)) by \(\sqrt{2}\) and the second by \(1\sqrt{2}\), then we get the same exact distance if we use both dimensions and a great approximation if we use both.

In this case \(Z\) is called an orthogonal rotation of \(X\): it preserves the distances between points.

Let’s formalize this a bit more.

Singular Value Decomposition (SVD)

The singular value decomposition (SVD) is a generalization of the algorithm we used in the motivational section. As in the example, the SVD provides a transformation of the original data. This transformation has some very useful properties.

The main result SVD provides is that we can write an matrix \(\mathbf{X}\) (of dimension \(m \times n\)) as

\[\mathbf{U}^\top\mathbf{X} = \mathbf{DV}^\top\]

With:

  • \(\mathbf{U}\) is an \(m \times p\) orthogonal matrix (left singular vectors)
  • \(\mathbf{V}\) is an \(n \times p\) orthogonal matrix (right singular vectors)
  • \(\mathbf{D}\) is an \(p \times p\) diagonal matrix (singular values)

with \(p=\mbox{min}(m,n)\). \(\mathbf{U}^\top\) provides a rotation of our data \(\mathbf{X}\) that turns out to be very useful because the variability of the columns of \(\mathbf{U}^\top \mathbf{X}\) (or \(\mathbf{DV}^\top\)) are decreasing. Because \(\mathbf{U}\) is orthogonal, we can write the SVD like this:

\[\mathbf{X} = \mathbf{UDV}^\top\]

Now we will apply a SVD to the motivating example we have using the svd() function in R:

## [1]   2 100

Note, we rotated X to put the (p=2) features along rows with n=100 observation along columns before applying svd().

The svd() command returns the three matrices and only the diagonal entries are returned for \(\mathbf{D}\).

## List of 3
##  $ d: num [1:2] 13.9 2.19
##  $ u: num [1:2, 1:2] -0.707 -0.707 -0.707 0.707
##  $ v: num [1:100, 1:2] 0.0499 -0.01324 0.00826 -0.08742 -0.01062 ...

To obtain the principal components from the SVD, we simply need the columns of the rotation \(\mathbf{U}^\top\mathbf{X}\):

## [1]   2 100

We can plot the \(n=100\) twin heights along the principal components

Alternatively, we could have used \(\mathbf{DV}^\top\):

Another way of calculating the PCs is to use prcomp() function.

## List of 5
##  $ sdev    : num [1:2] 1.4 0.22
##  $ rotation: num [1:2, 1:2] 0.707 0.707 0.707 -0.707
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "twin_height_1" "twin_height_2"
##   .. ..$ : chr [1:2] "PC1" "PC2"
##  $ center  : Named num [1:2] -8.40e-18 1.97e-17
##   ..- attr(*, "names")= chr [1:2] "twin_height_1" "twin_height_2"
##  $ scale   : Named num [1:2] 3.1 2.97
##   ..- attr(*, "names")= chr [1:2] "twin_height_1" "twin_height_2"
##  $ x       : num [1:100, 1:2] -0.694 0.184 -0.115 1.215 0.148 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:2] "PC1" "PC2"
##  - attr(*, "class")= chr "prcomp"

The output pc$x contains the principal components

The PCs are the same as the right singular vectors when they are centered and scaled. (signs of PCs are arbitrary)

Cool! So we have now reduced the \(p=2\) dimensions to 1 dimension that explains the most amount of variation.

OK, but how is this useful?

I will admit in our super simple example, it is not immediately obvious how incredibly useful the SVD can be. So let’s consider an example with more than two features. In this example, we will greatly reduce the dimension of \(\mathbf{V}\) and still be able to reconstruct \(\mathbf{X}\).

But first, in case you were curious, the singular value decomposition is widely used in industry, so it’s worth better understanding.

The singular value decomposition is very general in the sense that it can be applied to any m × n matrix whereas eigenvalue decomposition can only be applied to diagonalizable matrices. Nevertheless, the two decompositions are related. See this wikipedia link for more info.

Motivating example with the Written Digits

When you write and mail a letter, the first thing that happens to the letter when it’s received in the post office is that they are sorted by zip code:

Originally humans had to sort these but today, thanks to machine learning algorithms, a computer can read zip codes. The data is available in the data repository on the course github page.

## # A tibble: 5 x 5
##   label pixel0 pixel1 pixel2 pixel3
##   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1     1      0      0      0      0
## 2     0      0      0      0      0
## 3     1      0      0      0      0
## 4     4      0      0      0      0
## 5     0      0      0      0      0
## [1] 5000  784

Here are three images of written digits.

What are the features?

Each image is converted into \(28 \times 28\) pixels and for each we obtain an grey scale intensity between 0 (white) to 255 (black). This means one image has 784 (=28*28) features.

We can see these values like this:

What are the outcomes?

So for each digit \(i\) we have an outcome \(Y_i\) which can be one of 10 categories: \(0,1,2,3,4,5,6,7,8,9\) and the features \(X_{i,1}, \dots, X_{i,784}\) which can take values from 0 to 255. We use bold face to denote this vector of predictors \(\mathbf{X}_i = (X_{i,1}, \dots, X_{i,784})\).

784 features a lot of features. I’m going to assume the pixels close together are going to be somewhat correlated. Is there any room for data reduction? Can we identify a reduced set of features that can explain the variation in the data?

Calculating the top PCs

If you recall, the first PC is will explain the most variation, the second PC will explain the second most variation in the data, etc.

Because the pixels are so small we expect those to be close to each other on the grid to be correlated, meaning that dimension reduction should be possible.

Let’s take the SVD of \(\mathbf{X}\). We only read in 5000 observations, so this shouldn’t take very long, but if you do this on the entire dataset, this will take a little while.

## [1] 5000  784

Remember, we need to column center the data. We also will create a new variable \(\mathbf{Y}\) to represent the standardized data that is also transposed (features along rows).

## [1]  784 5000

Now apply the svd() function to \(\mathbf{Y}\).

## List of 3
##  $ d: num [1:784] 41094 35450 32720 30313 28936 ...
##  $ u: num [1:784, 1:784] 2.22e-16 -1.11e-16 -1.67e-16 1.39e-17 -6.94e-18 ...
##  $ v: num [1:5000, 1:784] -0.031526 0.000418 -0.004293 -0.00023 -0.000741 ...

First note that we can in fact reconstruct \(\mathbf{Y}\) using all the PCs:

## [1] 3.47228e-10

If we look at the eigenvalues in \(\mathbf{D}\), we see that the last few are quite close to 0.

This implies that the last columns of \(\mathbf{V}\) have a very small effect on the reconstruction of \(\mathbf{X}\). To see this, consider the extreme example in which the last entry of \(\mathbf{V}\) is 0. In this case the last column of \(\mathbf{V}\) is not needed at all.

Because of the way the SVD is created, the columns of \(\mathbf{V}\), have less and less influence on the reconstruction of \(\mathbf{X}\). You commonly see this described as “explaining less variance”. This implies that for a large matrix, by the time you get to the last columns, it is possible that there is not much left to “explain”.

As an example, we will look at what happens if we remove the 100 last columns:

## [1] 3.47228e-10

The largest residual is practically 0, meaning that Yhat is practically the same as Y, yet we need 100 less dimensions to transmit the information.

By looking at \(\mathbf{D}\), we can see that, in this particular dataset, we can obtain a good approximation keeping only a subset of columns. The following plots are useful for seeing how much of the variability is explained by each column:

We can also make a cumulative plot:

Although we start with 784 dimensions, we can approximate \(X\) with just a few:

Therefore, by using only 100 dimensions, we retain most of the variability in our data:

## [1] 0.9173817

We say that we explain 92 percent of the variability in our data with 100 PCs.

Note that we can compute this proportion from \(\mathbf{D}\):

## [1] 0.9173817

The entries of \(\mathbf{D}\) therefore tell us how much each PC contributes in term of variability explained.

Another way of calculating the PCs is to use prcomp() function.

The proportion of variance of the first ten PCs is quite high (almost 50%):

##                              PC1       PC2       PC3       PC4       PC5
## Standard deviation     581.21666 501.39563 462.77973 428.73238 409.25086
## Proportion of Variance   0.09811   0.07301   0.06220   0.05338   0.04864
## Cumulative Proportion    0.09811   0.17112   0.23332   0.28671   0.33535
##                              PC6       PC7       PC8       PC9      PC10
## Standard deviation     387.73924 331.29269 317.66005 307.02079 283.68683
## Proportion of Variance   0.04366   0.03188   0.02931   0.02738   0.02337
## Cumulative Proportion    0.37902   0.41089   0.44020   0.46758   0.49095

We can also plot the standard deviations:

or the more common plot variance explained:

We can also see that the first two PCs will in fact be quite informative. Here is a plot of the first two PCs, but colored by the labels that we ignored:

We can also “see” the linear combinations on the grid to get an idea of what is getting weighted:

Other methods of dimensionality reduction

Linear methods

Factor analysis

The main difference between PCA and factor analyais is that PCA is identifying a linear combination of variables, while factor analysis is a measurement model of a latent variable.

This latent variable cannot be directly measured with a single variable (e.g. intelligence, social anxiety, soil health). Instead, it is seen through the relationships it causes in a set of \(X\) variables. Best suited for situations where we have highly correlated set of variables. It divides the variables based on their correlation into different groups, and represents each group with a factor. Aims to identify latent sources of variation. Also typically tied to Gaussian distributions.

There are a ton of packages for factor analysis, but some you can try are factor.pa() function in the psych R package, for factanal() in the stats R package.

Independent Component Analysis (ICA)

While the goal in PCA is to find an orthogonal linear transformation that maximizes the variance of the variables, the goal of ICA is to find the linear transformation, which the basis vectors are statistically independent and non-Gaussian. Unlike PCA, the basis vectors in ICA are neither orthogonal nor ranked in order. In this case, all components are equally important. If you can think of your data as a mix of signals, then the ICA basis will have a vector for each independent signal

You can try ICA using the fastICA R package.

Non-linear methods

If there are complex polynomial relationships between your features, you might try non-linear dimensionality reduction methods

Isometric feature mapping (ISOMAP)

We use this technique when the data is strongly non-linear.

You can try ISOMAP using the Isomap() function in the RDRToolbox R package.

t-distributed Stochastic Neighbor Embedding (t-SNE)

t-SNE often provides a better data visualization than PCA, because a linear method struggles with modeling curved manifolds. It focuses on preserving the distances between widely separated data points rather than on preserving the distances between nearby data points.

You can try t-SNE using the Rtsne R package.

Pulling from this medium post, there is a nice explanation of t-SNE:

This technique is great at capturing the non-linear structure in high dimensional data, at least at a local level, meaning that if two points are close together in the high dimensional space, they have a high probability of being close together in the low dimensional embedding space.

If you would like more information on understanding what is t-SNE?, what it can show? and what it cannot show?, I would recommend reading through these illustrative examples.

A few highlights:

The goal is to take a set of points in a high-dimensional space and find a faithful representation of those points in a lower-dimensional space, typically the 2D plane. The algorithm is non-linear and adapts to the underlying data, performing different transformations on different regions. Those differences can be a major source of confusion.

The perplexity paramter is a tuneable parameter, “which says (loosely) how to balance attention between local and global aspects of your data. The parameter is, in a sense, a guess about the number of close neighbors each point has.

Cluster sizes in a t-SNE plot mean nothing”. A main feature of t-SNE is to equalize the density of points. “As a result, it naturally expands dense clusters, and contracts sparse ones, evening out cluster sizes.

Distances between clusters might not mean anything”. This is highly dependent on the perplexity parameter.

Uniform Manifold Approximation and Projection (UMAP)

This technique works well for high dimensional data. The run-time is shorter as compared to t-SNE.

You can try UMAP using the umap R package.

Pulling from this medium post, there is a nice summary / comparison of UMAP and t-SNE:

It’s the new kid on the dimensionality reduction block (in 2018), and it is very similar to t-SNE. If you compare visualizations created with t-SNE and UMAP, you might have a hard time telling them apart. However, UMAP appears to have some significant advantages over t-SNE:

  • It’s faster than t-SNE.
  • It captures global structure better than t-SNE.
  • Best of all, while t-SNE doesn’t have much use outside of visualization, UMAP is a general-purpose dimensionality reduction technique that can be used as preprocessing for machine learning.
  • UMAP also has a solid theoretical backing as a manifold approximation technique, whereas t-SNE is primarily a visualization heuristic.

The main disadvantage of UMAP is its lack of maturity. It is a very new technique, so the libraries and best practices are not yet firmly established or robust. However, if you’re willing to be an early adopter, UMAP has a lot to offer.

Resources and sources of material